Welcome to MapReduce Tutorials. The intent of these tutorials is to provide good understanding of MapReduce.In addition to these tutorials, we will also cover common issues, Interview questions and How To’s of MapReduce.
MapReduce is a programming model for processing large amounts of data in a parallel and distributed fashion. It is useful for large, long-running jobs that cannot be handled within the scope of a single request, tasks like:
-Analyzing application logs
-Aggregating related data from external sources
-Transforming data from one format to another
-Exporting data for external analysis
-Mapper: Consumes input data, analyzes it (usually with filter and sorting operations), and emits tuples (key-value pairs)
-Reducer: Consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data.
MapReduce programs are designed to compute large volumes of data in a parallel fashion. This requires dividing the workload across a large number of machines. This model would not scale to large clusters (hundreds or thousands of nodes) if the components were allowed to share data arbitrarily. The communication overhead required to keep the data on the nodes synchronized at all times would prevent the system from performing reliably or efficiently at large scale.
Instead, all data elements in MapReduce are immutable, meaning that they cannot be updated. If in a mapping task you change an input (key, value) pair, it does not get reflected back in the input files; communication occurs only by generating new output (key, value) pairs which are then forwarded by the Hadoop system into the next phase of execution.
The “trick” behind the following Python code is that we will use the Hadoop Streaming API (see also the corresponding wiki entry) for helping us passing data between our Map and Reduce code via STDIN
(standard input) and STDOUT
(standard output). We will simply use Python’s sys.stdin
to read input data and print our own output to sys.stdout
. That’s all we need to do because Hadoop Streaming will take care of everything else!
MapReduce can also be applied to lots of other problems. For example Google uses MapReduce to calculate their Pagerank.
-We have a large file of words, one word to a line
-Count the number of times each distinct word appears in the file
-Sample application: analyze web server logs to find popular URLs
Case 1: Entire file fits in memory Case 2: File too large for mem, but all <word, count> pairs fit in mem Case 3: File on disk, too many distinct words to fit in memory nsort datafile | uniq –c To make it slightly harder, suppose we have a large corpus of documents Count the number of times each distinct word occurs in the corpus words(docs/*) | sort | uniq -c where words takes a file and outputs the words in it, one to a line The above captures the essence of MapReduce Great thing is it is naturally parallelizable
map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key,result)
MapReduce is a programming framework that abstracts the complexity of parallel applications. The management architecture is based on the master/worker model, while a slave-to slave data exchange requires a P2P model. The simplified computational model for data handling means the programmer is unaware of the complexity of data distribution and management. There is a considerable degree of complexity because of the large number of data sets, scattered across hundreds or thousands of machines, and the need for lower computing runtime. The MapReduce architecture consists of a master machine that manages other worker machines.
-Text class(It stores String data)
-LongWritable
-FloatWritable
-IntWritable
-Boolean Writable
The custom data type can also be made by implementing Writable interface. Hadoop is capable of transmitting any custom data type (which fits your requirement) that implements Writable interface. Following is the Writable interface which is having two methods readFields and write. The first method (readFields) initializes the data of the object from the data contained in the ‘in’ binary stream. The second method (write) is used to reconstruct the object to the binary stream ‘out’. The most important contract of the entire process is that the order of read and write to the binary stream is same. Writable interface public interface Writable { void readFields(DataInput in); void write(DataOutput out); }-Distribute data and computation.The computation local to data prevents the network overload.
-Linear scaling in the ideal case.It used to design for cheap, commodity hardware.
-Simple programming model.The end-user programmer only writes map-reduce tasks.
-HDFS store large amount of information
-HDFS is simple and robust coherency model
-That is it should store data reliably.
-HDFS is scalable and fast access to this information and it also possible to serve s large number of clients by simply adding more machines to the cluster.
-HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.
-HDFS provide streaming read performance.
-Data will be written to the HDFS once and then read several times.
-The overhead of caching is helps the data should simply be re-read from HDFS source.
-Fault tolerance by detecting faults and applying quick, automatic recovery
-Processing logic close to the data, rather than the data close to the processing logic
-Portability across heterogeneous commodity hardware and operating systems
-Economy by distributing data and processing across clusters of commodity personal computers
-Efficiency by distributing data and logic to process it in parallel on nodes where data is located
-Reliability by automatically maintaining multiple copies of data and automatically deploying processing logic in the event of failures
-HDFS is a block structured file system: – Each file is broken into blocks of a fixed size and these blocks are stored across a cluster of one or more machines with data storage capacity
-Rough manner:- Hadoop Map-reduce and HDFS are rough in manner. Because the software under active development.
-Programming model is very restrictive:- Lack of central data can be preventive.
-Joins of multiple datasets are tricky and slow:- No indices! Often entire dataset gets copied in the process.
-Cluster management is hard:- In the cluster, operations like debugging, distributing software, collection logs etc are too hard.
-Still single master which requires care and may limit scaling
-Managing job flow isn’t trivial when intermediate data should be kept
-Optimal configuration of nodes not obvious. Eg: – #mappers, #reducers, mem.limits
In this discussion we have covered the most important Hadoop MapReduce features. These features are helpful for customization purpose. In practical MapReduce applications, the default implementation of APIs does not have much usage. Rather, the custom features (which are based on the exposed APIs) have significant impact. All these customizations can be done easily once the concepts are clear.
You liked the article?
Like: 0
Vote for difficulty
Current difficulty (Avg): Medium
TekSlate is the best online training provider in delivering world-class IT skills to individuals and corporates from all parts of the globe. We are proven experts in accumulating every need of an IT skills upgrade aspirant and have delivered excellent services. We aim to bring you all the essentials to learn and master new technologies in the market with our articles, blogs, and videos. Build your career success with us, enhancing most in-demand skills in the market.